Information on data

The following data is on New Orleans tornado building damage during December 2022. This data was obtained from Verisk Analytics and it was derived computer vision and machine learning using post-catastrophe aerial imagry data. There are approximately 42,000 buildings in this dataset.


FEMA scores

Here is some general information on what each type of score means in the context of tornado roof damage:


Before and after

Here are some interactive before and after aerial images that were taken:

This is an example of a building that has a catastrophe score of 100 (FEMA 6 / Destroyed)


This is another example of a building that has a catastrophe score of 100 (FEMA 6 / Destroyed)


This is an example of a building that has a catastrophe score of approximately 60 (FEMA 4 / Major)

Clean data

I converted roof_solar into a T/F statement, by converting “SOLAR PANEL” to TRUE and “NO SOLAR PANEL” to FALSE. In addition to this, I converted the roof shapes that the computer wasn’t very sure about (up to a 20% chance of being incorrect) into NA. There were some cells in damage_level where they were filled with an empty character, so I converted that into NA as well. I then separated longitude and latitude so that it could be easily read into leaflet.

df <- read.csv("clean_data.csv") %>% 
  janitor::clean_names() %>% 
  mutate(roofsolar = case_when(roofsolar == "SOLAR PANEL" ~ TRUE)) %>%
  mutate(roofshape = ifelse(roofshascr < 0.80, NA, roofshape)) %>%
  select(-c(roofshascr, roofcondit_discolordetect, roofcondit_discolorscore, roofcondit_discolorpercen, trampscr, roofcondit_tarppercen))

df$rooftopgeo <- gsub("POINT \\(|\\)", "", df$rooftopgeo)

df <- df %>%
  separate(rooftopgeo, into = c("long", "lat"), sep = " ", convert = TRUE)

df$damage_level <- ifelse(df$damage_level == "", NA, df$damage_level)
df$roofshape <- factor(df$roofshape, levels = c("gable", "hip", "flat"))
levels_roofmateri <- c("metal", "shingle", "membrane", "shake", "tile")
df$roofmateri <- factor(df$roofmateri, levels = c("gravel", levels_roofmateri))
df$roofmateri <- factor(df$roofmateri, levels = levels_roofmateri)

Catastrophe score is an aggregate of missing material, structural damage, and a few other attributes.


Building characteristics

Typically, insurance agencies consider gable roof shapes as more prone to damage than hip roof shapes. This graph illustrates the proportion of roofs of a certain shape:

In addition to this, shingle roofs are more easily damaged than metal or tile roofs, which is important to keep in mind because shingle roofs are used extremely often in the states. The graph below illustrates the proportion of roofs made with a certain material (shingle roofs clearly are most prominent):

Damage maps

Catastrophe scores are separated based on the summary of the data set, excluding the catastrophe scores of 0 (which is shown in the Models section):

mostdamage <- df %>% filter(catastrophescore >= 50)
nodamage <- df %>% filter(catastrophescore == 0)
decimated <-df %>% filter(catastrophescore == 100)
middamage <- df %>% filter(catastrophescore < 50 & catastrophescore >= 15)
leastdamage <- df %>% filter(catastrophescore < 15 & catastrophescore >= 2)
minimaldamage <- df %>% filter(catastrophescore == 1)

NOTE: Red indicates the buildings that were the most damaged (catastrophe score >= 50), orange indicates (25 < catastrophe score < 50), blue indicates (catastrophe score <= 25, excluding scores of 0). Only 3852 buildings experienced a nonzero catastrophe score, so the majority of the buildings (37,967) exhibited a catastrophe score of 0, which is shown in gray.

All points

This shows all of the catastrophe scores, the vast majority of roofs have no damage, which is denoted in gray.


No damage

Map of the buildings that experienced no damage. These are all of the roofs that have a catastrophe score of 0:

Damage

Map of the buildings that experienced any form of damage:

NOTE: Red indicates the buildings that were the most damaged and indicates a catastrophe score above or equal to 50, orange indicates a catastrophe score more than 25 and less than 50, blue indicates a catastrophe score at or below 25.


Let’s break down the damage:

Least damage

Map of the buildings that experienced the least damage. The catastrophe scores seen here are more than or equal to 2 and less than 15:

Mid damage

Map of the buildings that experienced mid catastrophe scores. These catastrophe scores vary from a score of 15 or above to a score less than 50:

Most damage

Map of the buildings that experienced the most damage. These buildings have catastrophe scores at or above 50: (Click the points!)

Destroyed

Map of the buildings that were completely destroyed. These have catastrophe scores of 100: (Click on the points here too!)

Models

Since most of the buildings in this dataset were not damaged by a tornado, the summary of the catastrophe scores of each building is skewed. This can be seen below:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   2.217   0.000 100.000

Check models: Extra

Due to this, I made models that excluded the catastrophe scores of 0 to just look into the structures that experienced damage. Below is the summary for the structures that exhibited damage:

In addition to this, the models I have made on the data could not have been able to entirely describe the data because the data was derived using computer vision machine learning from aerial imagery data. The computer vision models themselves have inherent error rates sometimes as height as 30% or 40%. Tornadoes are inherently chaotic such that they have a tendency to bounce around, which leads to seemingly random interactions with other substrates.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00    4.00   15.00   28.64   46.00  100.00

Models

## # Comparison of Model Performance Indices
## 
## Name  | Model |   AIC (weights) |  AICc (weights) |   BIC (weights) |    R2 |   RMSE |  Sigma
## ---------------------------------------------------------------------------------------------
## mods1 |   glm | 26272.9 (<.001) | 26272.9 (<.001) | 26314.3 (<.001) | 0.095 | 28.657 | 28.689
## mods2 |   glm | 26272.9 (<.001) | 26272.9 (<.001) | 26314.3 (<.001) | 0.095 | 28.657 | 28.689
## mods3 |   glm | 26094.1 (>.999) | 26094.1 (>.999) | 26147.3 (>.999) | 0.147 | 27.816 | 27.857
## mods4 |   glm | 30925.4 (<.001) | 30925.5 (<.001) | 30968.0 (<.001) | 0.156 | 28.795 | 28.821
## mods5 |   glm | 30773.9 (<.001) | 30773.9 (<.001) | 30828.6 (<.001) | 0.176 | 28.402 | 28.437

Out of the models I made, Model 3 appeared to work best. Though it should be noted that none of these models fit particularly well based on the variables used.

Model 3

check_model(mods3, type = "pearson")
Correlation plot between variables using Pearson correlation coefficient

Correlation plot between variables using Pearson correlation coefficient

  theme(text = element_text(size = 10))
## List of 1
##  $ text:List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : num 10
##   ..$ hjust        : NULL
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi FALSE
##  - attr(*, "validate")= logi TRUE
summary(mods5)
## 
## Call:
## glm(formula = catastrophescore ~ long + roofmateri + rooftree + 
##     enclosure, family = gaussian(link = "identity"), data = extra)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -75.62  -18.17   -8.69   13.16   82.23  
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        4460.13497 1149.62714   3.880 0.000107 ***
## long                 49.03688   12.76374   3.842 0.000124 ***
## roofmaterishingle   -23.29005    1.43088 -16.277  < 2e-16 ***
## roofmaterimembrane   12.42319    2.18262   5.692 1.37e-08 ***
## roofmaterishake     -21.90493    4.73851  -4.623 3.94e-06 ***
## roofmateritile      -22.84290    7.71403  -2.961 0.003087 ** 
## rooftree              0.56774    0.06876   8.257  < 2e-16 ***
## enclosureTRUE        44.80193   10.78228   4.155 3.34e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 808.6766)
## 
##     Null deviance: 3157376  on 3226  degrees of freedom
## Residual deviance: 2603130  on 3219  degrees of freedom
##   (10 observations deleted due to missingness)
## AIC: 30774
## 
## Number of Fisher Scoring iterations: 2
vif(mods5)
##                GVIF Df GVIF^(1/(2*Df))
## long       1.010340  1        1.005157
## roofmateri 1.020779  4        1.002574
## rooftree   1.012753  1        1.006356
## enclosure  1.004155  1        1.002076

Root mean squared error for Model 3

## [1] 27.81604

Root median squared error for Model 3

## [1] 17.34827

Predictions

Based on Model 3, I have made model predictions:

Here is a comparison of the predicted vs the actual catastrophe score:

I then plotted the predicted catastrophe scores alongside the actual catastrophe scores for reference.

Interpretations

The variables included in this dataset were shown to not be entirely helpful in predicting catastrophe scores accurately, which is exemplified in the graph above. More information would need to be considered, specifically, taking a look into tornadoes.